178 research outputs found
Gene ranking and biomarker discovery under correlation
Biomarker discovery and gene ranking is a standard task in genomic high
throughput analysis. Typically, the ordering of markers is based on a
stabilized variant of the t-score, such as the moderated t or the SAM
statistic. However, these procedures ignore gene-gene correlations, which may
have a profound impact on the gene orderings and on the power of the subsequent
tests.
We propose a simple procedure that adjusts gene-wise t-statistics to take
account of correlations among genes. The resulting correlation-adjusted
t-scores ("cat" scores) are derived from a predictive perspective, i.e. as a
score for variable selection to discriminate group membership in two-class
linear discriminant analysis. In the absence of correlation the cat score
reduces to the standard t-score. Moreover, using the cat score it is
straightforward to evaluate groups of features (i.e. gene sets). For
computation of the cat score from small sample data we propose a shrinkage
procedure. In a comparative study comprising six different synthetic and
empirical correlation structures we show that the cat score improves estimation
of gene orderings and leads to higher power for fixed true discovery rate, and
vice versa. Finally, we also illustrate the cat score by analyzing metabolomic
data.
The shrinkage cat score is implemented in the R package "st" available from
URL http://cran.r-project.org/web/packages/st/Comment: 18 pages, 5 figures, 1 tabl
A Multivariate Framework for Variable Selection and Identification of Biomarkers in High-Dimensional Omics Data
In this thesis, we address the identification of biomarkers in high-dimensional omics data. The identification of valid biomarkers is especially relevant for personalized medicine that depends on accurate prediction rules. Moreover, biomarkers elucidate the provenance of disease, or molecular changes related to disease. From a statistical point of view the identification of biomarkers is best cast as variable selection. In particular, we refer to variables as the molecular attributes under investigation, e.g. genes, genetic variation, or metabolites; and we refer to observations as the specific samples whose attributes we investigate, e.g. patients and controls. Variable selection in high-dimensional omics data is a complicated challenge due to the characteristic structure of omics data. For one, omics data is high-dimensional, comprising cellular information in unprecedented details. Moreover, there is an intricate correlation structure among the variables due to e.g internal cellular regulation, or external, latent factors. Variable selection for uncorrelated data is well established. In contrast, there is no consensus on how to approach variable selection under correlation.
Here, we introduce a multivariate framework for variable selection that explicitly accounts for the correlation among markers. In particular, we present two novel quantities for variable importance: the correlation-adjusted t (CAT) score for classification, and the correlation-adjusted (marginal) correlation (CAR) score for regression. The CAT score is defined as the Mahalanobis-decorrelated t-score vector, and the CAR score as the Mahalanobis-decorrelated correlation between the predictor variables and the outcome. We derive the CAT and CAR score from a predictive point of view in linear discriminant analysis and regression; both quantities assess the weight of a decorrelated and standardized variable on the prediction rule. Furthermore, we discuss properties of both scores and relations to established quantities. Above all, the CAT score decomposes Hotelling’s T 2 and the CAR score the proportion of variance explained. Notably, the decomposition of total variance into explained and unexplained variance in the linear model can be rewritten in terms of CAR scores.
To render our approach applicable on high-dimensional omics data we devise an efficient algorithm for shrinkage estimates of the CAT and CAR score. Subsequently, we conduct extensive simulation studies to investigate the performance of our novel approaches in ranking and prediction under correlation. Here, CAT and CAR scores consistently improve over marginal approaches in terms of more true positives selected and a lower model error. Finally, we illustrate the application of CAT and CAR score on real omics data. In particular, we analyze genomics, transcriptomics, and metabolomics data. We ascertain that CAT and CAR score are competitive or outperform state of the art techniques in terms of true positives detected and prediction error
Inferring Causal Relationships Between Risk Factors and Outcomes from Genome-Wide Association Study Data.
An observational correlation between a suspected risk factor and an outcome does not necessarily imply that interventions on levels of the risk factor will have a causal impact on the outcome (correlation is not causation). If genetic variants associated with the risk factor are also associated with the outcome, then this increases the plausibility that the risk factor is a causal determinant of the outcome. However, if the genetic variants in the analysis do not have a specific biological link to the risk factor, then causal claims can be spurious. We review the Mendelian randomization paradigm for making causal inferences using genetic variants. We consider monogenic analysis, in which genetic variants are taken from a single gene region, and polygenic analysis, which includes variants from multiple regions. We focus on answering two questions: When can Mendelian randomization be used to make reliable causal inferences, and when can it be used to make relevant causal inferences? Expected final online publication date for the Annual Review of Genomics and Human Genetics Volume 19 is August 31, 2018. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates
Task-related edge density (TED) - a new method for revealing large-scale network formation in fMRI data of the human brain
The formation of transient networks in response to external stimuli or as a
reflection of internal cognitive processes is a hallmark of human brain
function. However, its identification in fMRI data of the human brain is
notoriously difficult. Here we propose a new method of fMRI data analysis that
tackles this problem by considering large-scale, task-related synchronisation
networks. Networks consist of nodes and edges connecting them, where nodes
correspond to voxels in fMRI data, and the weight of an edge is determined via
task-related changes in dynamic synchronisation between their respective times
series. Based on these definitions, we developed a new data analysis algorithm
that identifies edges in a brain network that differentially respond in unison
to a task onset and that occur in dense packs with similar characteristics.
Hence, we call this approach "Task-related Edge Density" (TED). TED proved to
be a very strong marker for dynamic network formation that easily lends itself
to statistical analysis using large scale statistical inference. A major
advantage of TED compared to other methods is that it does not depend on any
specific hemodynamic response model, and it also does not require a
presegmentation of the data for dimensionality reduction as it can handle large
networks consisting of tens of thousands of voxels. We applied TED to fMRI data
of a fingertapping task provided by the Human Connectome Project. TED revealed
network-based involvement of a large number of brain areas that evaded
detection using traditional GLM-based analysis. We show that our proposed
method provides an entirely new window into the immense complexity of human
brain function.Comment: 21 pages, 11 figure
Selection of invalid instruments can improve estimation in Mendelian randomization
Mendelian randomization (MR) is a widely-used method to identify causal links
between a risk factor and disease. A fundamental part of any MR analysis is to
choose appropriate genetic variants as instrumental variables. Current practice
usually involves selecting only those genetic variants that are deemed to
satisfy certain exclusion restrictions, in a bid to remove bias from unobserved
confounding. Many more genetic variants may violate these exclusion
restrictions due to unknown pleiotropic effects (i.e. direct effects on the
outcome not via the exposure), but their inclusion could increase the precision
of causal effect estimates at the cost of allowing some bias. We explore how to
optimally tackle this bias-variance trade-off by carefully choosing from many
weak and locally invalid instruments. Specifically, we study a focused
instrument selection approach for publicly available two-sample summary data on
genetic associations, whereby genetic variants are selected on the basis of how
they impact the asymptotic mean square error of causal effect estimates. We
show how different restrictions on the nature of pleiotropic effects have
important implications for the quality of post-selection inferences. In
particular, a focused selection approach under systematic pleiotropy allows for
consistent model selection, but in practice can be susceptible to winner's
curse biases. Whereas a more general form of idiosyncratic pleiotropy allows
only conservative model selection, but offers uniformly valid confidence
intervals. We propose a novel method to tighten honest confidence intervals
through support restrictions on pleiotropy. We apply our results to several
real data examples which suggest that the optimal selection of instruments does
not only involve biologically-justified valid instruments, but additionally
hundreds of potentially pleiotropic variants.Comment: 56 pages, 8 figure
Modal-based estimation via heterogeneity-penalized weighting: model averaging for consistent and efficient estimation in Mendelian randomization when a plurality of candidate instruments are valid.
BACKGROUND: A robust method for Mendelian randomization does not require all genetic variants to be valid instruments to give consistent estimates of a causal parameter. Several such methods have been developed, including a mode-based estimation method giving consistent estimates if a plurality of genetic variants are valid instruments; i.e. there is no larger subset of invalid instruments estimating the same causal parameter than the subset of valid instruments. METHODS: We here develop a model-averaging method that gives consistent estimates under the same 'plurality of valid instruments' assumption. The method considers a mixture distribution of estimates derived from each subset of genetic variants. The estimates are weighted such that subsets with more genetic variants receive more weight, unless variants in the subset have heterogeneous causal estimates, in which case that subset is severely down-weighted. The mode of this mixture distribution is the causal estimate. This heterogeneity-penalized model-averaging method has several technical advantages over the previously proposed mode-based estimation method. RESULTS: The heterogeneity-penalized model-averaging method outperformed the mode-based estimation in terms of efficiency and outperformed other robust methods in terms of Type 1 error rate in an extensive simulation analysis. The proposed method suggests two distinct mechanisms by which inflammation affects coronary heart disease risk, with subsets of variants suggesting both positive and negative causal effects. CONCLUSIONS: The heterogeneity-penalized model-averaging method is an additional robust method for Mendelian randomization with excellent theoretical and practical properties, and can reveal features in the data such as the presence of multiple causal mechanisms
Selecting likely causal risk factors from high-throughput experiments using multivariable Mendelian randomization
Modern high-throughput experiments provide a rich resource to investigate causal determinants of disease risk. Mendelian randomization (MR) is the use of genetic variants as instrumental variables to infer the causal effect of a specific risk factor on an outcome. Multivariable MR is an extension of the standard MR framework to consider multiple potential risk factors in a single model. However, current implementations of multivariable MR use standard linear regression and hence perform poorly with many risk factors. Here, we propose a two-sample multivariable MR approach based on Bayesian model averaging (MR-BMA) that scales to high-throughput experiments. In a realistic simulation study, we show that MR-BMA can detect true causal risk factors even when the candidate risk factors are highly correlated. We illustrate MR-BMA by analysing publicly-available summarized data on metabolites to prioritise likely causal biomarkers for age-related macular degeneration
High-throughput multivariable Mendelian randomization analysis prioritizes apolipoprotein B as key lipid risk factor for coronary artery disease.
BACKGROUND: Genetic variants can be used to prioritize risk factors as potential therapeutic targets via Mendelian randomization (MR). An agnostic statistical framework using Bayesian model averaging (MR-BMA) can disentangle the causal role of correlated risk factors with shared genetic predictors. Here, our objective is to identify lipoprotein measures as mediators between lipid-associated genetic variants and coronary artery disease (CAD) for the purpose of detecting therapeutic targets for CAD. METHODS: As risk factors we consider 30 lipoprotein measures and metabolites derived from a high-throughput metabolomics study including 24Â 925 participants. We fit multivariable MR models of genetic associations with CAD estimated in 453Â 595 participants (including 113Â 937 cases) regressed on genetic associations with the risk factors. MR-BMA assigns to each combination of risk factors a model score quantifying how well the genetic associations with CAD are explained. Risk factors are ranked by their marginal score and selected using false-discovery rate (FDR) criteria. We perform supplementary and sensitivity analyses varying the dataset for genetic associations with CAD. RESULTS: In the main analysis, the top combination of risk factors ranked by the model score contains apolipoprotein B (ApoB) only. ApoB is also the highest ranked risk factor with respect to the marginal score (FDR <0.005). Additionally, ApoB is selected in all sensitivity analyses. No other measure of cholesterol or triglyceride is consistently selected otherwise. CONCLUSIONS: Our agnostic genetic investigation prioritizes ApoB across all datasets considered, suggesting that ApoB, representing the total number of hepatic-derived lipoprotein particles, is the primary lipid determinant of CAD
- …